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Abstract 

The problem of assigning probability distributions which objectively 
reflect the prior information available about experiments is one of the ma- 
jor stumbling blocks in the use of Bayesian methods of data analysis. In 
this paper the method of Maximum (relative) Entropy (ME) is used to 
translate the information contained in the known form of the likelihood 
into a prior distribution for Bayesian inference. The argument is inspired 
and guided by intuition gained from the successful use of ME methods 
in statistical mechanics. For experiments that cannot be repeated the re- 
sulting "entropic prior" is formally identical with the Einstein fluctuation 
formula. For repeatable experiments, however, the expected value of the 
entropy of the likelihood turns out to be relevant information that must 
be included in the analysis. The important case of a Gaussian likelihood 
is treated in detail. 

1 Introduction 

The inference of physical quantities from data generated either by experiment or 
by numerical simulation is a ubiquitous and often cumbersome task. Whether 
the data is corrupted by noise, hampered by finite resolution or tied up in cor- 
relations, in principle it should always be possible to improve the analysis by 
taking into account, in addition to the information contained in the data, what- 
ever other knowledge one might have about the physical quantities to be inferred 
or about how the data was generated. The way to link this prior information 
with the new information in the data is found in Bayesian probability theory. 

Bayesian methods are increasingly popular in physics ^ . They are essential 
whenever repeating the experiment many times in order to reduce the measure- 
ment uncertainty is either too expensive or time consuming. This is a common 
situation in astronomy and astrophysics 2 , and also in large laboratory exper- 
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iments as in fusion |3I and in high energy physics 0]. Other typical uses in 
physics arise in spectrum restoration, in ill-posed inversion problems [21 El CI 
and when separating a signal from an unknown background [S]. Applications 
include mass spectrometry |5] , Rutherford backscattering HI and nuclear mag- 
netic resonance 

From a general point of view the problem of inductive inference is to update 
from a prior probability distribution to a posterior distribution when new infor- 
mation becomes available. The challenge is to develop updating methods that 
are systematic and objective. Two methods have been found which are of very 
broad applicability: one is based on Bayes' theorem and the other is based on 
the maximization of entropy. The choice between these two updating methods 
is dictated by the nature of the information being processed. 

When we want to update our beliefs about the values of certain quantities 9 
on the basis of the observed values of other quantities y - the data - and of some 
known relation between 9 and y we must use Bayes' theorem. The updated or 
posterior distribution is p{9\y) cx ■K{9)p{y\9)] the relation between y and 9 is 
supplied by a known model p{y\9); the previous knowledge about 9 is codified 
both into the "prior" probability 7r(0) and also in the "likelihood" distribution 

p{v\o)- 

The selection of the prior tt{9) is a controversial issue which has generated 
an enormous literature ■ The difficulty is that it is not clear how to carry out 
an objective translation of our previous beliefs about 9 into a distribution 7r(0). 
One reasonable attitude is to admit subjectivity and recognize that different 
individuals may start from the same information and legitimately end with 
different translations. In simple cases experience and physical intuition have 
led to a considerable measure of success, but we are often confronted with new 
complex situations involving perhaps parameter spaces of high dimensionality 
where we have neither a previous experience nor a reliable intuition. 

On the other hand, there are special cases where some degree of objectivity 
can be attained. For example, requirements of invariance can go a long way 
towards the complete specification of a prior. Considerable effort has been 
spent seeking an objective characterization of that elusive state of knowledge 
that presumably reflects complete ignorance. Although there are convincing 
arguments against the existence of such non- informative priors |13| . the search 
has had the merit of suggesting connections with the notion of entropy jE] 
including two proposals for "entropic priors" ^1 E| ■ This brings us to the 
second method of processing information. 

Bayes' theorem follows from the product rule for joint probabilities, p{y, 9) = 
■K{9)p{y\9), and therefore its applicability is restricted to situations where as- 
sertions concerning the joint values of the data y and the parameters 9 are 
meaningful. But there are situations where the available information is of a dif- 
ferent nature and involves assertions about the probabilities themselves. Such 
information, which includes but is not limited to assertions about expected val- 
ues, cannot be processed using Bayes' theorem. 

The method of Maximum Entropy (ME) is designed for updating from a prior 
probability distribution to a posterior distribution when the information to be 
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processed takes the form of a constraint on the family of acceptable posterior 
distributions ^7]. The early and less satisfactory justification of the ME method 
followed from interpreting entropy, through the Shannon axioms, as a measure 
of the amount of uncertainty in a probability distribution [181 119| . Objections 
to this approach are that the Shannon axioms refer to probabilities of discrete 
variables, the entropy of continuous variables is not defined, and that the use 
of entropy as the unique measure of uncertainty remained questionable. Other 
so-called entropies could and, indeed, were introduced. Ultimately, the real 
problem is that Shannon was not concerned with inductive inference. He was 
not trying to update probability distributions but was instead analyzing the 
capacity of communication channels. Shannon's entropy makes no reference to 
prior distributions. 

Considerations such as these motivated several attempts to justify the ME 
method directly as a method of inductive inference without invoking question- 
able measures of uncertainty [211 E] ■ The concept of relative entropy is then 
introduced as a tool for consistent reasoning which, in the special case of uni- 
form priors, reduces to the usual entropy. There is no need for an interpretation 
in terms of heat, disorder, or uncertainty, or even in terms of an amount of 
information. Perhaps this is the explanation of why the search for the meaning 
of entropy has turned out to be so elusive: strictly, entropy needs no interpre- 
tation. In section 2, as background for the rest of the paper, we present a brief 
outline of one such 'no-interpretation' approach inspired by 

In this paper we use entropic arguments to translate prior information into 
a prior distribution. Rather than seeking a totally non-informative prior, we 
make use of information that we do in fact have. Remarkably, it turns out 
that the very conditions that allow us to contemplate using Bayes' theorem - 
namely, knowledge of a likelihood function, p{y\9) - already constitute valuable 
prior information. In this sense one can assert that the search for completely 
non-informative priors is misplaced: if we do not know the likelihood, then prior 
distributions are not needed anyway. The prior thus obtained is an "entropic 
prior." The name and the first proposal of a prior of this kind is due to Skilling 
[T^l for the case of discrete distributions. The generalization to the continuous 
case and further elaborations by Rodriguez constitute a second proposal. 

It is essential for the successful use of any prior, and of entropic priors in 
particular, to be aware of what information they contain and, crucially, what 
information they do not contain. No prior can be expected to succeed unless all 
the information relevant to the problem at hand has been taken into account. It 
is quite likely that most practical problems that were encountered with entropic 
priors in the past can be traced to a failure to identify and incorporate all the 
relevant information. 

The information that has, in this paper, been translated into the entropic 
prior is that contained in the likelihood. The hare entropic priors discussed here 
apply to a situation where all we know about the quantities 6 is that they appear 
as parameters in the likelihood p{y\6), and nothing else. Generalizations are, 
of course, possible. Sometimes we are aware of additional relevant information 
beyond what is contained in the likelihood and it can easily be incorporated 
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into a modified entropic prior. Other times we might be guilty of overlooking 
additional information we already have. Indeed, we would not be willing to 
spend valuable effort in the determination of a parameter 9 unless we suspected 
that knowledge of 9 has important implications elsewhere. Typically we know 
something about the physical significance and the physical meaning of 9. It is 
clear that in these cases we know considerably more than just that is a pa- 
rameter appearing in the likelihood. We might even conceive of several different 
experiments, e = 1,2, . . ., each yielding different sets of data ye related to 9 
by different likelihood functions Pe{ye\9)- It is sometimes objected that one's 
prior knowledge about 9 should not depend on which experiment one decides 
to use to measure it, but this objection is misplaced: the mere fact that 9 is 
measurable through one or another experiment is additional information which, 
if relevant, should be taken into account. 

Another family of problems that can be tackled as a rather straightforward 
extension of the ideas described here involve choosing which likelihood distri- 
bution from among several competing candidates is responsible for generating 
the data. Indeed, it is clear that any systematic approach to model selection 
requires as a prerequisite the capability to process in an objective way the in- 
formation implicit in each of those likelihoods. Except for some brief remarks 
in the final section, all these further developments, valuable as they might be, 
will be addressed elsewhere. 

Our contribution includes a derivation of an entropic prior (section 3) follow- 
ing the same principles of ME inference that have been successful in statistical 
mechanics. In fact, our whole approach is guided by intuition gained from ap- 
plications of ME to statistical mechanics. Preliminary steps along this direction 
were taken in |24j where a problem with the important case of experiments that 
can be indefinitely repeated had already been identified but not fully resolved. 
This problem, re-examined in section 4, is interpreted as a symptom that impor- 
tant relevant information has been overlooked. The complete resolution, which 
hinges on identifying and incorporating this additional information, is given in 
sections 5 and 6. The actual way in which ME is used in the derivation, in 
analogy to standard applications in statistical mechanics, turns out to be im- 
portant because it clarifies what it is that has been derived and how to use it: 
ours is, in effect, a third proposal for an entropic prior. In section 7 we discuss 
in detail the important example of a Gaussian likelihood and finally, in section 
8, we summarize and comment on the differences among the three versions of 
entropic prior and on possible further developments. 

2 The logic behind the ME method 

The goal is to update beliefs about y €Y which are codified in the prior prob- 
ability distribution m{y) to a posterior distribution p{y) when new information 
in the form of a constraint becomes available. (The constraints can, but need 
not, be linear.) The selection is carried out by ranking the probability distribu- 
tions according to increasing preference. One feature we impose on the ranking 
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scheme is transitivity: if distribution pi is preferred over distribution p2 , and p2 
is preferred over p3, then pi is preferred over p^. Such transitive rankings are 
implemented by assigning to each p{x) a real number S[p] called the entropy of 
p in such a way that if pi is preferred over p2, then S[pi] > S[p2]- The selected 
p will be that which maximizes S[p]. Thus the method involves entropies which 
are real numbers and entropies that should be maximized. These are features 
imposed by design; they are dictated by the function that the ME method is 
supposed to perform . 

Next we determine the functional form of S[p]. This is the rule that defines 
the ranking scheme. The purpose of the rule is to do induction. We want to 
extrapolate, to generalize from those special cases where we know what the 
preferred distribution should be to the much larger number of cases where we 
do not. Thus, in order to be an inductive rule S[p] must have wide applicability; 
we will assume that the same rule applies to all cases. There is no justification 
for this universality except for the usual pragmatic justification of induction: we 
must be inclined to generalize lest we become paralyzed into not generalizing 
at all. But then, we should remain cautious and keep in mind that in many 
instances induction just fails. 

The argument goes as follows If a general theory exists, then it must 
apply to special cases. Furthermore, if in a certain special case the preferred 
distribution is known, then this knowledge can be used to constrain the form 
of S[p]. Finally, if enough special cases are known, then S[p] will be completely 
determined. The known special cases are called the "axioms" of ME. As we 
will see below the axioms reflect the conviction that one should not change 
one's mind frivolously, that whatever was learned in the past is important. The 
chosen posterior distribution should coincide with the prior as closely as possible 
and one should only update those aspects of one's beliefs for which corrective 
new evidence has been supplied. The three axioms are listed below. 

Axiom 1: Locality. Local information has local effects. We do not re- 
vise the relative probabilities p{y')/p{y) with y and y' within a certain domain 
D G Y unless the newly provided information refers explicitly to the domain 
D. The power of this axiom stems from the arbitrariness in the choice of D. 
The consequence of the axiom is that non-overlapping domains of y contribute 
additively to the entropy: S[p] — J dy F{p(y)) where F is some unknown func- 
tion. 

Axiom 2: Coordinate invariance. The ranking should not depend on 
the system of coordinates. The coordinates that label the points y are ar- 
bitrary; they carry no information. The consequence of this axiom is that 
^[p] — I dy p{y) f [p{y) / m{y)) involves coordinate invariants such as dyp{y) and 
p{y)/m{y), where the density m{y) and the function / arc, at this point, mi- 
known. 

Next we make a second use of the locality axiom to enforce objectivity. We 
allow domain D to extend over the whole space Y and assert that when there 
is no new information there is no reason to change one's mind. When there 
are no constraints the selected posterior distribution should coincide with the 
prior distribution. This eliminates the arbitrariness in the density m{y): up to 



5 



normalization m(i/) is the prior distribution. 

Axiom 3: Consistency for independent subsystems. When a system is 
composed of independent subsystems it should not matter whether the inference 
procedure treats them separately or jointly. If y = (2/1,2/2) & Y — Yi x Y2, and 
the subsystem priors mi (j/i ) and m2 (2/2) are respectively upgraded to pi (yi ) and 
^2(2/2), then the prior for the whole system 7711(2/1)7712(2/2) should be upgraded 
to Pi{yi)p2{y2) ■ This axiom restricts the function / to be a logarithm. (The 
fact that the logarithm applies also when the subsystems are not independent 
follows from our inductive hypothesis that the ranking scheme has universal 
applicability.) 

The overall consequence of these axioms |^ is that probability distributions 
p{y) should be ranked relative to the prior m{y) according to their (relative) 
entropy [TT] . 



The derivation has singled out S[p, m] as the unique entropy to be used in induc- 
tive inference. Other expressions, such as S[m,p], or S[p,m] + S[m,p], or even 
expressions that do not involve the logarithm, may be useful for other purposes, 
but they do not constitute an induction: they are not a generalization from the 
simple cases described in the axioms. 

We end this section with two comments on the prior density m{y). First, 
S[p, m] may be infinitely negative when m{y) vanishes within some region D. 
In other words, the ME method confers an overwhelming preference on those 
distributions p{y) that vanish whenever m{y) does. Is this a problem? Not 
really. A similar "problem" also arises in the context of Bayes' theorem. A 
vanishing prior represents a tremendously serious prejudice because no amount 
of data to the contrary would allow us to revise it. The solution in both cases 
is to recognize that unless we are absolutely certain that y could not possibly 
lie within D then we should not have assigned rn{y) = in the first place. 
Assigning a very low but non zero prior represents a safer and less prejudiced 
representation of one's beliefs and/or doubts both in the context of Bayesian 
and of ME inference. 

Second, choosing the prior density m{y) can be tricky. When there is no 
information leading us to prefer one microstate of a physical system over an- 
other we might as well assign equal prior probability to each state. Thus it is 
reasonable to identify 777(2/) with the density of states and the invariant m{y)dy 
is the number of microstates in dy. This is the basis for statistical mechanics. 
Other examples of relevance to physics arise when there is no reason to prefer 
one region of the space Y over another. Then we should assign the same prior 
probability to regions of the same "volume," and we can choose dym{y) to be 
the volume of a region R in the space Y . Notice that because of the presence of 
the prior m{y) not all subjectivity has been eliminated and Laplace's principle 
of insufficient reason still plays an important role, albeit in a somewhat modi- 
fied form. Just as with Bayes' theorem, what is objective here is the manner in 
which information is processed, not the initial probability assignments. 




(1) 



6 



3 Entropic priors: the basic idea 



In this section we follow [2^ closely. We use the ME method to derive a prior 
Tr(9) for use in Bayes' theorem, 



The selection of a preferred distribution using the ME method demands that the 
space in which the search will be conducted be specified. Being a consequence of 
the product rule for joint probabilities, Bayes' theorem requires that assertions 
such as 'y and 6*' be meaningful and that the 'probability of y and 0' be well 
defined. Therefore we must focus our attention on p{y, 9) rather than 7t{6); the 
relevant universe of discourse is neither O, the space of all 6s, nor the data space 
Y, but the product Q x Y. This point, first made by Rodriguez [221, is central 
to the argument. Our derivation and the final result, however, differ from his 
in several respects |211I22|- 

To rank distributions in the space Q xY we must decide on a prior m(j/, 9). 
At this starting point absolutely nothing is known about the variables 9, in 
particular, they have no physical meaning, and no relation between y and 9 is 
known. The 9s are totally arbitrary. Therefore the prior must be a product 
m{y)^{9) of the separate priors in the spaces Y and Q. Indeed, the distribution 
that maximizes the relative entropy 



when no constraints are imposed is p{y,9) cx ■m{y)^{9); it is such that data 
about y tells us absolutely nothing about 9. 

In what follows we assume that m(jj) is known. We consider this an impor- 
tant part of understanding what data it is that has been collected. In section 7 
we will suggest a reasonable m{y) for the special case of a Gaussian likelihood. 
The prior fj,(9) remains unspecified. 

Next we incorporate the crucial piece of information from which the param- 
eters 9 derive their physical meaning and which establishes the relation between 
9 and y: the likelihood function p{y\9) is known. This has two consequences: 
First, the joint distribution p{y, 9) is constrained to be of the form TT{9)p{y\9). 
Notice that this constraint is not in the form that is most usual for applications 
of the ME method: it is not an expectation value. Note also that the only 
information we are using about the quantities 9 is that they appear as param- 
eters in the likelihood p{y\9), nothing else. In many situations of experimental 
interest there exists additional relevant information beyond what is contained 
in the likelihood; such information should be included as additional constraints 
in the maximization of the relative entropy a. 

Second, now that a bare minimum is known about 9, namely that each 9 
represents a probability distribution, there is a natural but still subjective choice 
for IJ,{9). As discussed in |221, except for an overall multiplicative constant, there 
is a unique Riemannian metric that adequately reflects the fact that the points 



p{9\y) ocp{y,9) 



7r{9)p{y\9) . 



(2) 
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in a space of probability distributions are not 'structureless', but happen to 
be probability distributions; this is the Fisher-Rao metric. Within the finite- 
dimensional subspace defined by the constraint - the hnown p{y\9) - the natural 
metric on Q is di'^ — gijdO^dO^ , where the unique gij induced by the family of 
distributions p{y\9) is 

g,,=J dyp{y\e) — — . (4) 

Accordingly we choose /i(6') = g^^^{6), whereg{9) is the determinant of g^j . Hav- 
ing identified the prior measure and the constraints, we allow the ME method 
to take over. 

The preferred distribution p(y, 9) is chosen by varying ■k{0) to maximize 

a[7r] = - / dyd9n[9)p{y\9) log ^^M^ (5) 
J g^l\9)'m{y) 

d9^{9)\og^^^+ J d9^{e)S{9) , 

where S{9) is the entropy of the likelihood, 

S{9) = - ( dyp{y\9)\og?^. (6) 
J m{y) 

Writing the Lagrange multiplier that enforces / d9 ■k{9) = 1 as 1 — logC, and 
assuming p{y\9) is normalized yields 

0^Jd9(^- log + 3(9)- log S7:{9) , (7) 

Therefore the probability that the value of 9 should lie within the small volume 
g^/^{9)d9 is 

■^{9)d9 = i e''^''^g^'\9)d9 with C = j d9 g^'\9) e^^'\ (8) 

This entropic prior is our first main result. It tells us that the preferred value of 
9 is that which maximizes the entropy S{9) because this maximizes the scalar 
probability density exp5'(^). It also tells us the degree to which values of 9 
away from the maximum are ruled out; in many cases the preference for the 
ME distribution can be overwhelming. Note also that the density exTpS{9) is 
a scalar function and the presence of the Jacobian factor g^^^{9) makes eq.ljHJ 
manifestly invariant under changes of the coordinates 9 in the space Q. 

We can claim a partial success. The ingredients that have been used are 
precisely those that led us to consider using Bayes' theorem in the first place. 
The information contained in the model - by which we mean that the data 
space y, its measure m{y), and the conditional distribution p{'y\9) - has been 
translated into a prior tt{9). The success is partial because it has been achieved 
for the special case of the fixed data space Y of those experiments which cannot 
conceivably be repeated. A more complete treatment requires that we address 
the important case of experiments that can be repeated indefinitely. 
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4 Repeatable experiments 



Experiments need not be repeatable but sometimes they are. Let us assume 
that successive repetitions are possible and that they happen to be independent. 
Suppose, to be specific, that the experiment is performed twice so that the space 
of data Y xY = consists of the possible outcomes yi and 1/2 ■ Suppose further 
that 6 is not a "random" variable; the value of 9 is fixed but unknown. Then 
the joint distribution in the space Q x Y^ is 

p{yuy2,e) = n^'\e)p{yuy2m = 7r^'He)p{yi\e)p{y2\e), (9) 

and the appropriate a entropy is 

a(2)[7r] / dyidy2dep{y^,y2,e) log- ^^^^^f^ , (10) 

where g^^\0) is the determinant of the Fisher-Rao metric for p{yi,y2\0). From 
Eq.lgl) it follows that g^f = 2^^ so that g'^'^\0) = 2'^g{6), d being the dimension 
of 0. Maximizing cr(^^[7r] subject to J d9 ■jt'-^'> (6) = 1 we get 

-^'H^) = ^5^/^(^)e^"'(^) = ^9'^'iO)e''('\ (11) 

where S^'^\e) = 28(6) is the entropy of p{yi,y2\0), and S{e) =^ S'(i)(6i). The 
generalization to N repetitions of the experiment, with data space F^, is im- 
mediate. 

This is clearly wrong: the dependence of 7r(^) on the amount N of data would 
lead us to a perpetual revision of the prior as more data is collected. The 
absurdity of this situation becomes manifest when we consider the case of large 
N. Then the exponential preference for the value of 9 that maximizes S{9) 
becomes so pronounced that no amount of data to the contrary can successfully 
overcome its effect. The data becomes irrelevant, and the more data we have, 
the more irrelevant it becomes. 

Repeatable experiments present us with a problem. One possible attitude 
is to blame the ME method: it gives nonsense and cannot be trusted. As with 
all inductive methods this is, of course, a logical possibility. A second, more 
constructive approach, is to always be prepared to question the results of ME 
calculations on the basis that there is no guarantee that all the information 
relevant to the situation at hand has been taken into account. The problem 
is not a failure of the ME method but a failure to include all the relevant 
information. 

That this is indeed the case can be seen as follows: When we say an ex- 
periment can be repeated twice, N = 2, we actually know more than just 
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p{yi,y2\d) — p{yi\0)p{y2\0). We also know that forgetting or discarding the 
value of say y2, yields an experiment that is totally indistinguishable from the 
single, TV = 1, experiment. This additional information is quantitatively ex- 
pressed by / dy2piyi,y2, d) = p{yi,6), or equivalently 

j dy2^^^\0)p{y^\6)p{y2\e) = ii^^\6)p{y,\9) , (13) 

which leads to 7r(^^(6') — tt'-^^{6). In the general case we get the manifestly 
reasonable result 

7r(^) (9) = 7r(^-i) (0) = . . . = 7r(i) {9) . (14) 

The challenge then is to identify a constraint that codifies this information 
within each space & x . 

5 More information: the Lagrange multiplier a 

The problem with the prior 'k^^\9) in ea. (|12|l is that it expresses an overwhelm- 
ing preference for the value f?max of 9 that maximizes the entropy <S'(^). Indeed, 
as — > oo we have n'^^\9) S{9 — ^?max) leading to 

(S) - J d9n^''\9)S{9) S{9n,,.) , (15) 

which is manifestly incorrect. This suggests that a better prior would be ob- 
tained by maximizing the entropy a^^^ of distributions on the space space 
B X subject to an additional constraint on the numerical value S of the 
expected entropy (S). It is not that we happen to know the numerical value 5' 
of (S). In fact we do not. It is rather that we recognize that information about 
S is relevant in the sense that if S were known the problem above would not 
arise. Naturally, additional effort will be required to obtain the needed value of 
S. 

The logic of the previous paragraph may sound unfamiliar and further com- 
ments may be helpful. When justifying the use of the ME method to obtain, say, 
the canonical Boltzmann-Gibbs distribution (Pg oc e~^^'') it has been common 
to say something like "we seek the minimally biased {i.e. maximum entropy) 
distribution that codifies the information we do possess (the expected energy) 
and nothing else" . Many authors find this justification objectionable. Indeed, 
they might argue, for example, that the spectrum of black body radiation is 
what it is independently of whatever information happens to be available to us. 
We prefer to phrase our objection differently: in most realistic situations the 
expected value of the energy is not a quantity we happen to know. Nevertheless, 
it is still true that maximizing entropy subject to a constraint on this (unknown) 
expected energy leads to correct predictions. Therefore, the justification behind 
imposing a constraint on the expected energy cannot be that this is a quantity 
that happens to be known ~ because it is not - but rather that the expected 
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energy is the quantity that should be known. Even if unknown, we recognize it 
as the crucial relevant information without which no successful predictions can 
be made. Therefore we proceed as if this crucial information were available and 
produce a formalism that contains the temperature as a free parameter that 
will later have to be obtained from the experiment itself. In other words, the 
temperature (or expected energy) is one additional parameter to be inferred 
from the data. 

The entropy on the space O x is 



^-Jdd TTid) log +N j dd nie)Sie) (16) 

where S{9) given by eq.JHJl. (A constant factor oi N'^^^ associated to the Fisher- 
Rao measure g'-^^O) has been omitted. It would eventually be absorbed into 
the normalization of tt{6).) To obtain the prior tt{9) we maximize ct^^^ subject 
to constraints on (S) and that tt be normalized. 



S cr^^' + (1-logC) ( I d0Tr{e)-l] +Xn { I den{0)S{e)- S 
This gives, 



. (17) 



(- log + + ^^)^(^) - logc) Snie) = . (18) 

Therefore, 

m = ^g'/'iO) exp [{N + XN)S{e)] . (19) 

The undesired dependence on N is eliminated if in each space Q x the La- 
grange multipliers Aat are chosen so that A^-|- Aat = a is a constant independent 
of N. The resulting entropic prior, 

<0\a) = ^5V2(0),aSW ^ (20) 
C(a) 

satisfies ea. (|14|) . This is our second main result. The prior TT{9\a) codifies 
information contained in the likelihood function, plus information about the 
expected value of the entropy of the likelihood implicit in the hyper-parameter 
a, 

^(a) = ^logC(a) , (21) 

with (^(a) is given by 

C(a) ^ f de 5i/2(0)gaS(e) (22) 
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The next and final step is figure out whiicii a applies to the particular exper- 
imental situation under consideration. The natural way to proceed is to invoke 
Bayes' theorem 

The choice of a prior 7r(a) for a itself is addressed in the next section. If we 
were truly interested in the actual a, we could marginalize over 9 to obtain 



p(a,e\y'')=nia)7r{0\a)i^^^. (23) 



Pia\y^) = / dOpia, d\y^) = / d07ri9\a)p{y^ \9) . (24) 



p{y^) 

But our interest in the value of a is only indirect; a is a necessary but annoy- 
ing technical complication along the way to the real goal which is inferring 9. 
Marginalizing over a, we get 



p{9\y'') = / dapia, 0|y^) = ^W^^ (25) 



where 



tt{9) = J daTT{a)'K{9\a). (26) 

This is the answer we sought: the effective prior for 9, the averaged tt{9), is 
independent of the actual data y^ , as it should. The last step is the assignment 
of 7r(a). 



6 An entropic prior for a 

To remain consistent with the spirit of this paper, namely using ME to obtain 
priors, the prior for a must itself be an entropic prior. The motivation behind 
discussing entropic priors is that we wish to consider information included in 
the likelihood function. Since p{'y\9) refers to 9 but makes no reference to any 
hyper-parameters it is quite clear that a should not be treated like the other 
9s. The relation between a and the data y is indirect: a is related to 0, and 9 
is related to y. Once 9 is given, the data y becomes irrelevant, it contains no 
further information about a. The whole significance of a is derived purely from 
its appearance in 7r(6'|a), ea. H20() . Therefore, the relevant universe of discourse 
is A X with a G A. We focus our attention on the joint distribution 

7r(a, e) = 7r(a)7r(6'|a) . (27) 

and we obtain 7r(a) by maximizing the entropy 

m^^jda d9 .(a, 9) log (28) 

where 7^/^(a) is determined below. Since no reference is made to repeatable 
experiments in there is no need for any further constraints except for nor- 
malization. 
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The Fisher-Rao measure ^^/"^{a) in ea. H28|) is 



7(a) = d07:{e\a) — log7r(6'|a) 
J [da 

Using eqs.(I2ni),IlIJ and we get 

dlogC(a)"'^ 



7(a) = / d0Tr{0\a) 



8(9) 



da 



but 



^^(a) _ d 1 dC(a) _ 1 d'^Cja) 
da da ({a) da ({a) da^ 

Therefore, 



1 dC,{a) 
C(a) da 



7(a) 



rf'logC(«) 
da^ 



(29) 



(30) 



(31) 



(32) 



The interpretation is straightforward: the distance between 7r(^?|a) and 7r(^^|a-|- 
da) is given by 

j^^^{a)da = AS{a)da, (33) 

or, in words, the local entropy uncertainty AS* is the distance per unit change 
in a. 

To maximize S rewrite it as 



where s(a) is given by 



7r(a) 

da 7r(a) log — j-^ + / da 7r(a) s(a), 



(34) 



(35) 



— log C(a) — a 
Then, varying with respect to 7r(a) gives 



t^logC(«) 
da 



7r(a) = -7^^^(a)e 



s{a) 



(36) 



This is our third main result. It completes our derivation of the actual prior for 
9: the averaged Tt{9) in ea. H26|) codifies information contained in the likelihood 
function, plus the insight that for repeatable experiments, information about 
the expected likelihood entropy, even if unavailable, is relevant. 

We argued above that the hyper-parameter a should not be treated in the 
same way as the other parameters 9 because the likelihood 7r(y|6') refers only to 
9s and not to a. Nonetheless, it may still be worthwhile to discuss briefly what 
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would happen if a were treated as one of the 9s. In this case, the entropic prior 
7r(a) would be determined by focusing our attention on the joint distribution 

p(a,0,y^)=7r(aV(0|aMy^|0), (37) 

where the last two factors on the right are assumed known. The assumed uni- 
verse of discourse would he Ax Q x . A straightforward application of the 
ME method would, as before, run into trouble with an unwanted N dependence 
which would require the introduction of a new constraint on the appropriate 
expected entropy. Thus, the entropic prior for a would involve a second hyper- 
parameter a2- The unknown 012 would itself require its own entropic prior, 
involving yet a third hyper-parameter a^, and so on. There would be an endless 
chain of hyper-parameters |16| . In any practical calculation, the chain would 
have to be truncated. Whether the predictions about 6 depend on where and 
how the truncation is carried out remains to be studied. But, fortunately, this 
is not necessary: a is not like the other Os. 



7 Example: a Gaussian model 

Consider data = {j/i, . . . , yjsr} that are scattered around an unknown value 

y = lJi + v (38) 

with {v) = and {v^) — . The goal is to estimate the parameters 9 = 
{9^,9^) = (/i, cr) on the basis of the data y^ and the information implicit in 
the model: the data space F, the measure m{y) (discussed below), and the 
Gaussian likelihood, 



2(72 



(39) 



In section 3 we asserted that knowing the measure m{y) is part of know- 
ing what data has been collected. Therefore, nothing can be said about m{y) 
without further specification of the experimental situation. It turns out, how- 
ever, that in many physical situations where the data happen to be distributed 
according to ea. H39l) the underlying space Y is sufficiently symmetric, i.e., in- 
variant under translations, that we can assume m{y) ~ m = constant . This is 
physically reasonable. Gaussian distributions arise when the measured value of 
y is the sum of a large number of "microscopic" contributions and the details 
of how the individual contributions are themselves distributed are washed out 
in the "macroscopic" sum. The macroscopically relevant features are just those 
that distinguish one Gaussian from another, namely, the mean ^ and the vari- 
ance cr^ This is the physical basis behind the Central Limit Theorem. But if 
microscopic details are irrelevant it should be possible to understand the situ- 
ation from a purely macroscopic point of view: it should be possible to obtain 
the Gaussian distribution as the preferred one among all those with the given fj, 
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and 0-2, and this is, indeed, the case: setting m{y) = constant in S[p,m], eq.(C3), 
and maximizing subject to constraints on the mean and variance yields eg . (1391) . 
From eqs. (O and (|39|l the entropy of the Ukehhood is 



S{fi,a) = log 



dcf / C \ ■'^/^ 1 

where ctq = " ' (^0) 



and the corresponding Fisher- Rao measure, from eq.|01) is 

1/^2 



g{fi,<T) = det 



2/a 



' (41) 



t4 



Note that both a) and g(fi, a) are independent of /i. This means that 
if we were concerned with the simpler problem of estimating /i in a situation 
where a happens to be known, then the entropic prior, in any of the versions 
eq.®, (EDI I or if^ . is a constant independent of ^. In other words, when a 
is known, the Bayesian estimate of fj, using entropic priors coincides with the 
maximum likelihood estimate, i.e., by the popular procedure of minimizing 

2 



X 



(42) 



1=1 



Returning to the more interesting case of unknown cr, the a-dependent en- 
tropic prior, eq. 1(201) is 

2l/2 ^Q-2 

.(M,a|«) = ^^. (43) 

7r(/i, (T|a) is improper in both ^ and a; normalization requires the introduction 
of high and low cutoffs for both fj, and a. The fact that without cutoffs the 
model is not well defined is an indication that more relevant information is 
being requested: the cutoffs constitute relevant information that must be taken 
into account. (The logic parallels that which led to the introduction of a in 
section 5.) The case of unknown cutoff values is important and we intend to 
explore it in detail in future work. The basic idea is that specifying cutoffs is 
an integral part of defining the model, and therefore the choice of cutoffs can 
be tackled as a problem of model selection. In the remainder of this section, 
however, we will assume that the information about cutoffs is already available. 

It is convenient to write the range of as A/i = — Ml and to define 
the a cutoffs in terms of dimensionless quantities El and eh] <^ extends from 
cl = croEi to Oh — oqIsu- Then C(a) and 7r(/i,cr|ck) are given by 

C(") = ^ " 1 ■ 44 
fJo a — 1 



and 



a — \ fa 



7r(^,cr|a) — 3:^^^ [ — ] ■ (45) 
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Notice that in the special case of a — 1, the prior over a reduces to da /a 
which is cahed the Jeffreys prior and is usually introduced by the requirement 
of invariance under scale transformations, a Xa. 

Writing s {£l£h)^^^, the prior for a can be obtained from ea. H32|l . 



7(a) = 

and from eas. H2t)|) and (|35|l . 

vi/2(a) el-" _ 



{a-iy 



2\oge 



7r(a) 



7 



1 



■ exp 



-l-Q 



-a-1 



1 



'- — 7 loge 



(46) 



(47) 



where the normalization z has been suitably redefined. 

Eas. (|46|l and (|47(l simplify considerably when we take the limit e ^ 0. 
Clearly the same result is obtained whether we let ejj — > while keeping el 
fixed, or letting ^ while keeping eh fixed, or even allowing eh and 

— > simultaneously. The resulting 7(a) and 7r(a) are 



7(a) 



1 



(a -1)2 



and 



7r(a) 



(l-a) 



1 exp 




for 
for 



a <1 
a> 1 



(48) 



(49) 



■-^ a for a — ^ —00 
This suggests that 



where 7r(a) is normalized. This is shown in Fig. ^ 

7r(a) reaches its maximum value at a = 1/2. Since T:{a 
the expected value of a and all higher moments diverge 
replacing the unknown a in the prior 7r(0|a) by any given numerical value a is 
probably not a good approximation. 

As explained in section 5, since a is unknown, the effective prior for 9 = (/i, cr) 
is obtained marginalizing 7r(/i,cr, a) = 7r(/i, cr|a)7r(a) over a, ea. H26() . Since 
7r(a) — for a > 1 as e — > we can safely take the limit eh — * or an 00. 
Conversely, since TT{a) 7^ for a < 1 we cannot take el ^ or ctl ^ 0. The 
limit an — > 00 while keeping cfl fixed gives, 



7r(/i, a,a) = < ^t^^L 



The averaged prior for fi and a is 



for 
for 



a < 1 
a> 1. 



(50) 



7r(/i,o-) 
which integrates to 



1 



f- 





1 


exp 




1 


- a 



— I da 



^(^,a) = ^i^o(2^1og^ ) , 



(51) 



(52) 
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7u(a) 




Figure 1: The prior 7r(a) for various values of the cutoff parameter e, as e ^ 0. 



where Kq is a modified Bessel function of the second kind. This is the entropic 
prior for the Gaussian model. The function 

Fix) = ^Ko (2^1^) (53) 

is shown in Fig. |5]as a function of x — a /a^. 

P{x) has an integrable singularity as x — > 1 where it behaves as 

P{x) « — ^— log v^logx — 7^ for X fa 1 . (54) 

Since ctl is a lower cutoff the region of large x is more relevant. The leading 
asymptotic behavior is given by 



P{x) w — (-2Vlogx) for a; > 1. (55) 

X (logx) ^ ^ 

Finally, we turn to Bayes' theorem, ea. (|25|l . with the prior H52I) to obtain 
estimators for fi and a. For large N the results are independent of the prior and 
the estimators coincide with the standard maximum likelihood results. The case 
when N is not so large is the more interesting one. As estimators we can take 
the expected values (/i) and (ct^) over the posterior H25|l . The integrations can 
be performed numerically and are not particularly illuminating. Alternatively, 
one can follow standard practice and marginalize ea. (|25|l over a to obtain the 
distribution p{iJ.\y^) and calculate the estimator fl from 

= , (56) 
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and its error bar a from 

1 



2 l0gP('"|y^) 



- a2 



(57) 



When p(/Lt|y^) happens to be a Gaussian these estimators coincide with the 
expected values (/x) and (cr^). The final result for /t is very simple. For any 
value of N we have ^ 

i 

the estimator /t is the sample average. The result for a is not as elegant but, of 
course, for large it asymptotically reduces to « (xp- — if')/N. 



8 Final remarks 

In this paper the method of maximum relative entropy has been used to trans- 
late the information contained in the known form of the likelihood into a prior 
distribution for Bayesian inference. The argument follows closely the analo- 
gous ME methods that have been so successful in statistical mechanics. For 
experiments that cannot be repeated the resulting "entropic prior" is formally 
identical with the Einstein fluctuation formula. For rcpeatablc experiments, 
however, the expected value of the entropy of the likelihood - represented in 
terms of a Lagrange multiplier a - turns out to be relevant information that 
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must be included in the analysis. As an illustration the important case of a 
Gaussian likelihood was treated in detail. 

It may be useful to comment briefly on the differences between our entropic 
prior and the versions previously proposed by Skilling and by Rodriguez. Per- 
haps the main difference with Skilling's prior is that, unlike ours, its use is not 
restricted to probability distributions but is intended for generic "positive ad- 
ditive distributions" including, for example, the distributions of intensities in 
images One problem here is that of justifying the applicability of the ME 
method in such a general context. Our impulse to generalize is a dangerous 
one; we may get away with indulging it occasionally but overindulgence will 
certainly lead to error. In any case, our argument in section 3, which consists 
in maximizing the entropy a subject to a constraint p{y, 9) = 'n:{6)p(y\9), makes 
no sense in the case of generic positive additive distributions for which there 
is no available product rule. A more specific problem arises from the fact that 
Skilling's entropy is not, in general, dimensionless and the hyper-parameter a is 
vaguely interpreted some sort of cutoff carrying the appropriate corrective units. 
Some of the difficulties, which led Skilling to seek an alternative approach, were 
identified in 

Rodriguez's approach is closer to ours. His prior applies to probability distri- 
butions and appears to be derived from a ME principle (23i| . One difference, per- 
haps a minor one, is his treatment of the underlying measure m{y). For us m{y) 
is not arbitrary; knowing m{y) is part of knowing what data has been collected. 
For him m{y) is just an initial guess and he suggests setting m{y) = p(?/|6'o) for 
some value 9o- The more important difference, however, is that the number of 
observed data n is deliberately and explicitly left unspecified. The space O x F" 
over which distributions are defined, and therefore the distributions themselves, 
also remain unspecified. It is not clear what the maximization of an entropy 
over such unspecified spaces could possibly mean but a hyper-parameter a is 
eventually introduced and it is interpreted as a "virtual number of observations 
supporting the initial guess 6*0-" He proposes that a be considered as one more 
among the parameters to be inferred. As mentioned earlier this leads to the 
introduction of an endless chain of additional hyper-parameters. 

There are several directions in which the ideas of this paper can be further 
extended. First, we emphasize once again that the entropic priors discussed 
here apply to a situation where all we know about the quantities is that 
they appear as parameters in the likelihood p{y\0), and nothing else. In many 
situations of experimental interest there exists additional relevant information 
beyond what is contained in the likelihood. Such information should be included 
as additional constraints in the maximization of the relative entropy a in eq. H17() . 
The resulting modified entropic prior would provide a better representation of 
our state of knowledge prior to the acquisition of the data. Indeed, the advantage 
of the Bayesian approach over the usual method of maximum likelihood is the 
possibility of including additional relevant information by replacing a flat prior 
by an appropriately more informative prior. There is nothing to prevent us from 
performing a similar improvement and going beyond the "bare" entropic priors 
discussed in this paper. Two kinds of additional information that are easy to 
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include arc restrictions on the range of the parameters 9 and information about 
the known expected values of some variables a{9). Steps in this direction were 
taken in section 5, where a{6) is the likelihood entropy, and in section 7 where 
high and low cutoffs on the range of the Gaussian parameters were introduced. 

Second, in the introduction we mentioned the interesting possibility of ana- 
lyzing data ye from different experiments, e = 1, 2, . . ., related to 6 by different 
likelihood functions pe{ye\9). Clearly this can be analyzed as a single combined 
experiment with likelihood 2/2, •• • |^) = Pi{yi\0)P2{y2\0) • ■ ■ to which all our 
previous results apply. As we stated earlier, the mere fact that 9 is measurable 
through one or another experiment is additional relevant information that can 
be taken into account. 

Third, we also mentioned that problems of model selection can be tackled 
as an extension of the ideas described in this paper. On the basis of data y 
we want to select one model among several competing candidates labeled by 
m = 1, 2, . . . with likelihood distributions given by p{y\m, 9m)- The answer, i.e., 
the probability of model m given the data y, is given by Bayes' theorem, 



This is exact. The problem is solved, at least in principle, once an entropic prior 

for 7r(TO, 9m) is assigned. However, the remaining practical problems associated 
with carrying out the actual numerical calculations could, of course, still be 
quite formidable. 

Finally, we end with a word of caution. As in all instances of inductive 
inference there is the possibility that predictions based on the ME method could 
be wrong because not all the information relevant to the problem at hand was 
taken into account. This potential problem is not peculiar to the ME method, it 
is a problem shared by all methods of induction. Nevertheless, we are confident 
that the rewards of extending the benefits of an inductive method singled out 
by requirements of objectivity, the ME method, beyond its traditional territory 
of statistical mechanics and into that of data analysis will be enormous. 
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